人类脑中脑中的背景利用异质感官信息,以有效地执行包括视觉和听力的认知任务。例如,在鸡尾酒会党的情况下,人类听觉Cortex上下文中的视听(AV)提示才能更好地感知言论。最近的研究表明,与音频SE模型相比,AV语音增强(SE)模型可以显着提高信噪比(SNR)环境的极低信号的语音质量和可懂度。然而,尽管在AV SE的领域进行了显着的研究,但具有低延迟的实时处理模型的开发仍然是一个强大的技术挑战。在本文中,我们为低延迟扬声器的独立AV SE提供了一种新颖的框架,可以概括一系列视觉和声学噪声。特别地,提出了一种生成的对抗性网络(GaN)来解决AV SE的视觉缺陷的实际问题。此外,我们提出了一种基于神经网络的深度神经网络的实时AV SE模型,考虑到从GaN的清洁的视觉语音输出来提供更强大的SE。拟议的框架使用客观语音质量和可懂度指标和主观上市测试对合成和真实嘈杂的AV语料库进行评估。比较仿真结果表明,我们的实时AV SE框架优于最先进的SE方法,包括最近的基于DNN的SE模型。
translated by 谷歌翻译
基于深度学习(DL)的语音增强方法通常优化,以最小化干净和增强语音功能之间的距离。这些经常导致语音质量改善,但它们缺乏普遍化,并且可能无法在实际嘈杂情况下提供所需的语音可懂度。为了解决这些挑战,研究人员已经探索了智能性(I-O)丢失函数和用于更强大的语音增强(SE)的视听(AV)信息的集成。在本文中,我们介绍了基于DL的I-O SE算法利用AV信息,这是一种新颖且以前未开发的研究方向。具体而言,我们介绍了一个完全卷积的AV SE模型,它使用改进的短时客观可懂度(STOI)度量作为培训成本函数。据我们所知,这是第一个利用基于I-O的I-O的损耗函数的AV模式集成的第一项工作。比较实验结果表明,我们提出的I-O AV SE框架优于与传统距离的损耗功能训练的仅音频(AO)和AV模型,就标准客观的扬声器和噪声处理。
translated by 谷歌翻译
Visual language such as charts and plots is ubiquitous in the human world. Comprehending plots and charts requires strong reasoning skills. Prior state-of-the-art (SOTA) models require at least tens of thousands of training examples and their reasoning capabilities are still much limited, especially on complex human-written queries. This paper presents the first one-shot solution to visual language reasoning. We decompose the challenge of visual language reasoning into two steps: (1) plot-to-text translation, and (2) reasoning over the translated text. The key in this method is a modality conversion module, named as DePlot, which translates the image of a plot or chart to a linearized table. The output of DePlot can then be directly used to prompt a pretrained large language model (LLM), exploiting the few-shot reasoning capabilities of LLMs. To obtain DePlot, we standardize the plot-to-table task by establishing unified task formats and metrics, and train DePlot end-to-end on this task. DePlot can then be used off-the-shelf together with LLMs in a plug-and-play fashion. Compared with a SOTA model finetuned on more than >28k data points, DePlot+LLM with just one-shot prompting achieves a 24.0% improvement over finetuned SOTA on human-written queries from the task of chart QA.
translated by 谷歌翻译
Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art vision-language models do not perform well on these data. We propose MatCha (Math reasoning and Chart derendering pretraining) to enhance visual language models' capabilities in jointly modeling charts/plots and language data. Specifically, we propose several pretraining tasks that cover plot deconstruction and numerical reasoning which are the key capabilities in visual language modeling. We perform the MatCha pretraining starting from Pix2Struct, a recently proposed image-to-text visual language model. On standard benchmarks such as PlotQA and ChartQA, the MatCha model outperforms state-of-the-art methods by as much as nearly 20%. We also examine how well MatCha pretraining transfers to domains such as screenshots, textbook diagrams, and document figures and observe overall improvement, verifying the usefulness of MatCha pretraining on broader visual language tasks.
translated by 谷歌翻译
Deep Learning and Machine Learning based models have become extremely popular in text processing and information retrieval. However, the non-linear structures present inside the networks make these models largely inscrutable. A significant body of research has focused on increasing the transparency of these models. This article provides a broad overview of research on the explainability and interpretability of natural language processing and information retrieval methods. More specifically, we survey approaches that have been applied to explain word embeddings, sequence modeling, attention modules, transformers, BERT, and document ranking. The concluding section suggests some possible directions for future research on this topic.
translated by 谷歌翻译
Through their transfer learning abilities, highly-parameterized large pre-trained language models have dominated the NLP landscape for a multitude of downstream language tasks. Though linguistically proficient, the inability of these models to incorporate the learning of non-linguistic entities (numerals and arithmetic reasoning) limits their usage for tasks that require numeric comprehension or strict mathematical reasoning. However, as we illustrate in this paper, building a general purpose language model that also happens to be proficient in mathematical reasoning is not as straight-forward as training it on a numeric dataset. In this work, we develop a novel framework that enables language models to be mathematically proficient while retaining their linguistic prowess. Specifically, we offer information-theoretic interventions to overcome the catastrophic forgetting of linguistic skills that occurs while injecting non-linguistic skills into language models.
translated by 谷歌翻译
随着互联网和智能手机的广泛影响,电子商务平台的用户群越来越多。由于本地语言用户的英语不是熟悉的,因此他们首选的浏览模式是他们的区域语言或区域语言和英语的组合。从我们最近关于查询数据的研究中,我们注意到我们收到的许多查询都是代码混合物,特别是hinglish,即用英语(拉丁)脚本写的一个或多个印地语单词的查询。我们为代码混合查询转换提出了一种基于变压器的方法,以使用户可以通过这些查询进行搜索。我们证明了在该任务上未标记的英语文本的大型语料库中训练的预训练的编码模型的有效性。使用通用域翻译模型,我们创建了一个伪标记的数据集,用于培训有关搜索查询的模型,并验证了各种数据增强技术的有效性。此外,为了减少模型的延迟,我们使用知识蒸馏和权重量化。该方法的有效性已通过实验评估和A/B测试验证。该模型目前在Flipkart应用程序和网站上直播,可供数百万个查询。
translated by 谷歌翻译
随着电子商务平台的民主化,一个越来越多的用户群选择在线购物。为了提供舒适和可靠的购物体验,重要的是要使用户以其选择的语言与平台进行互动。准确的查询翻译对于带有白话查询的跨语性信息检索(CLIR)至关重要。由于互联网规模的运营,电子商务平台每天都会获得数百万次搜索查询。但是,创建平行训练集以训练训练内域翻译模型很麻烦。本文提出了一种无监督的域适应方法,可以在不使用任何平行语料库的情况下翻译搜索查询。我们使用开放域翻译模型(在公共语料库中训练),并仅使用两种语言的单语言查询将其调整到查询数据中。此外,用小标签集进行微调进一步改善了结果。为了进行演示,我们显示了印地语对英文查询翻译的结果,并使用mbart-large-50模型作为改进的基线。实验结果表明,在不使用任何平行语料库的情况下,我们获得了20多个BLEU点比基线的改进,同时用小50K标签集进行微调可提供比基线的27个以上BLEU点的改进。
translated by 谷歌翻译
光学成像通常用于行业和学术界的科学和技术应用。在图像传感中,通过数字化图像的计算分析来执行一个测量,例如对象的位置。新兴的图像感应范例通过设计光学组件来执行不进行成像而是编码,从而打破了数据收集和分析之间的描述。通过将图像光学地编码为适合有效分析后的压缩,低维的潜在空间,这些图像传感器可以以更少的像素和更少的光子来工作,从而可以允许更高的直通量,较低的延迟操作。光学神经网络(ONNS)提供了一个平台,用于处理模拟,光学域中的数据。然而,基于ONN的传感器仅限于线性处理,但是非线性是深度的先决条件,而多层NNS在许多任务上的表现都大大优于浅色。在这里,我们使用商业图像增强器作为平行光电子,光学到光学非线性激活函数,实现用于图像传感的多层预处理器。我们证明,非线性ONN前处理器可以达到高达800:1的压缩率,同时仍然可以在几个代表性的计算机视觉任务中高精度,包括机器视觉基准测试,流程度图像分类以及对对象中对象的识别,场景。在所有情况下,我们都会发现ONN的非线性和深度使其能够胜过纯线性ONN编码器。尽管我们的实验专门用于ONN传感器的光线图像,但替代ONN平台应促进一系列ONN传感器。这些ONN传感器可能通过在空间,时间和/或光谱尺寸中预处处理的光学信息来超越常规传感器,并可能具有相干和量子质量,所有这些都在光学域中。
translated by 谷歌翻译
在过去十年中引发了自然语言处理(NLP)研究的神经繁荣,同样导致了数据之间的大量创新(DTG)。这项调查提供了对神经DTG范式的合并视图,对方法,基准数据集和评估协议进行了结构化检查。这项调查划出了将DTG与其余自然语言产生(NLG)景观分开的边界,涵盖了文献的最新综合,并突出了更大的NLG伞内外的技术采用阶段。通过这种整体观点,我们重点介绍了DTG研究的有希望的途径,不仅关注具有语言能力的系统的设计,而且还集中在表现出公平和问责制的系统上。
translated by 谷歌翻译